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The  authors  describe 
a  class  of  directory 
coherence  protocols 
called  delta  coherence 
protocols.  These 
protocols  use  network 
guarantees  to  support 
a  new  and  highly 
concurrent  approach  to 
maintaining  a  consistent 
shared  memory. 


Delta  Coherence 
Protocols 


Caching  data  can  reduce  access  latency  and  improve  data  avail¬ 
ability,  but  it  also  raises  the  problem  of  how  to  maintain  con¬ 
sistency  among  copies  of  writable  data.  The  problem  appears 
in  different  guises  in  different  contexts:  it  appears  as  the  cache 
coherence  problem  in  multiprocessors,  as  the  problem  of  maintaining 


a  distributed  shared  memory  in  distributed 
computations,  and  as  the  replica  control 
problem  in  distributed  databases.  This  arti¬ 
cle  describes  the  home  update  protocol ,  a 
member  of  the  class  of  coherence  proto¬ 
cols  called  delta  coherence  protocols  that  uses 
isotach  guarantees 1  to  solve  the  coherence 
problem  in  a  new  and  highly  concurrent 
way.  (Due  to  space  constraints,  and  to  avoid 
obscuring  the  basic  concept  of  the  proto¬ 
cols,  we  describe  the  protocol  at  a  high  level 
and  do  not  address  practical  implementa¬ 
tion  issues.)  Our  goal  is  to  show  how  iso¬ 
tach  guarantees  are  useful  in  solving  the 
coherence  problem  and  in  reasoning  about 
coherence  protocols. 

The  coherence  problem  is  difficult, 
because  it  requires  coordinating  events 
across  nodes.  The  traditional  approach  to 
the  problem  is  to  reduce  the  coordination 
required  by  limiting  concurrency  or  weak¬ 
ening  the  correctness  criteria.  Hardware- 
based  coherence  protocols  are  tradition¬ 
ally  divided  into  two  classes:2  snoopy 
protocols ,  which  require  a  shared  bus,  and 
directory  protocols ,  intended  for  point-to- 
point  networks.  A  shared  bus  serializes 
memory  requests.  This  serialization  read¬ 
ily  yields  an  agreed  total  order  among 
requests,  but  it  limits  concurrency  and 


scalability.  Directory  protocols  are  more 
scalable,  but  existing  directory  protocols 
that  enforce  sequential  consistency  (SC) 
require  that  nodes  execute  requests  one  at 
a  time  and  invalidate  or  lock  copies  while 
executing  write  requests. 

Delta  protocols  use  isotach  guarantees 
to  coordinate  accesses,  an  approach  that 
lets  delta  protocols  enforce  SC  without 
limiting  concurrency.  However,  delta  pro¬ 
tocols  require  isotach  guarantees.  Whether 
delta  coherence  protocols  outperform 
existing  protocols  depends  on  the  cost  of 
implementing  isotach  guarantees  and  on 
the  extent  to  which  applications  can  take 
advantage  of  the  high  level  of  concurrency 
delta  protocols  offer. 

Isotach  systems 

An  isotach  (Greek  translation:  iso ,  same; 
tach ,  speed)  system  implements  a  logical 
time  system1  in  which  all  messages  appear 
to  travel  at  the  same  speed — one  unit  of 
logical  distance  per  unit  of  logical  time. 
Given  this  property,  called  the  isotach 
invariant,  a  processor  can  control  the  log¬ 
ical  receive  time  of  a  message  it  sends  by 
controlling  the  logical  send  time. 

Neighboring  nodes  in  an  isotach  system 
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for  copy  c 

5(c) 

The  delta  of  c.  In  home  update  protocol,  dist{home ,  c) 

for  operation  op  executed  on  copy  c 

Mop) 

Send  time  of  op 

related  by  isotach  invariant: 

Mop) 

Receive  time  of  op  L 

tr(op)  =  ts(  op)  +  d{  op) 

d(  op) 

Logical  distance  of  op 

Mop) 

Execution  time  of  op 

t/op)  =  £r(op),  by  assumption 

ttfxi op) 

Effective  execution  time  of  op 

teffxi OP)  =  Mop)  -  5(C) 

XiisA  op) 

Execution  distance  of  op 

Xdisl OP)  =  t^( Op)  -  Mop)  = 
d{ op)  -  5{c) 

Figure  1.  Delta  coherence  protocol  terms  and  notation. 


exchange  signals  called  tokens  to  imple¬ 
ment  a  distributed  logical  clock.  The 
pulse  at  a  processor  is  the  number  of 
tokens  the  processor  has  received.  An 
isotach  logical  time  is  a  lexicographically 
ordered  three-tuple  in  which  the  first 
and  most  significant  component  is  the 
pulse  at  the  processor  where  the  event 
occurs.  The  remaining  two  components, 
the  process  identifier  (pid)  and  rank ,  are 
tie-breakers  used  to  order  send  and 
receive  events  that  occur  in  the  same 
pulse.  The  sender  pid  orders  events  with 
identical  pulse  components.  The  rank — 
or  issue  order — orders  events  with  iden¬ 
tical  pulse  and  pid  components. 

The  isotach  logical  time  system  extends 
Leslie  Lamport’s  logical  time  system3 
by  guaranteeing  that  send  and  receive 
times  are  consistent  with  the  isotach 
invariant:  each  message  travels  one  unit  of 
logical  distance  per  pulse  of  logical  time. 
Isotach  systems  can  implement  a  vari¬ 
ety  of  distance  metrics.4  Here,  dist(p,  p') 
— the  logical  distance  from  node  p  to 
node  p' — is  the  routing  distance  from  p 
to  p' — that  is,  the  number  of  switches  tra¬ 
versed  by  a  message  thatp  sends  top'.  For 
any  message  m  thatp  sends  top',  dim) — 
the  logical  distance  message  m  travels — 
is  dist(p,  p').  For  simplicity,  we  assume 
distances  are  static.  Distances  can  be 
asymmetric — that  is,  dist(p,  p')  does  not 
necessarily  equal  dist(p',p).  By  the  isotach 
invariant,  for  any  message  m ,  ml s  logical 
receive  time  is  exactly  d(m)  pulses  after 
ml s  logical  send  time,  so  tr(m)  =  ts(m)  + 
dim).  (The  scalar  quantity  dim )  is  added 
to  the  tuple  tfm)  by  adding  dim)  to  the 
tuple’s  pulse  component.)  We  assume 
processors  execute  messages  in  receive 
order.  Thus,  for  any  message  m,  tx(m) — 
mis  logical  execution  time — equals  tfm). 


This  assumption  is  for  simplicity  and  is 
stronger  than  necessary.5 

Most  delta  protocols  require  an  isotach 
system  that  supports  predictable  responses.  A 
predictable  response  is  a  message  m  sent 
in  response  to  another  message  m  such 
that  we  can  predict  the  send  time  of  m' 
from  the  receive  time  of  m\  tfm)  =  trini)  + 
c.  For  simplicity,  we  assume  c  is  0.  (In  any 
practical  system,  c  is  a  small  tunable  sys¬ 
tem  constant,  greater  than  zero.)  Given 
the  isotach  invariant  and  knowledge  of  the 
logical  distances  involved,  we  can  predict 
the  receive  time  of  ml  from  the  send  time 
of  m\  tr(m)  =  ts(m)  +  dim)  +  dim).  A  pre¬ 
dictable  response  inherits  the  original 
message’s  pid  and  rank  components. 

Each  processor  has  a  switch  interface 
unit  (SIU)  that  tracks  logical  time  and  acts 
as  the  interface  between  applications  and 
the  isotach  system.  An  application  can 
assume  that  the  isotach  system  will  appear 
to  execute  its  messages  in  the  order  issued. 
Given  the  isotach  invariant  and  the 
assumption  that  the  system  will  execute 
messages  in  the  order  received,  an  SIU  can 
control  the  relative  order  in  which  locally 
issued  messages  appear  to  be  executed. 
In  particular,  an  SIU  can  ensure  that  a 
batch  of  locally  issued  messages  appear  to 
be  executed  at  the  same  time  by  sending 
the  messages  so  that  the  destinations 
receive  the  messages  in  the  same  logical 
pulse.  An  SIU  can  also  ensure  that  mes¬ 
sages  issued  in  a  sequence  appear  to  be 
executed  in  that  sequence  by  sending  the 
messages  so  that  the  destinations  receive 
the  messages  in  nondecreasing  pulses. 

Isotach  systems  can  be  implemented 
using  the  isonet  algorithm,  in  which  net¬ 
work  switches  route  messages  in  logical 
time  order.1  Alternatively,  an  implemen¬ 
tation  can  shift  the  work  of  ordering  mes¬ 


sages  to  the  SIUs  to  permit  the  use 
of  commodity  switches.  The  Iso¬ 
tach  Project  at  the  University  of 
Virginia  has  implemented  a  proto¬ 
type  system  based  on  this  approach 
on  a  cluster  of  commodity  PCs  con¬ 
nected  with  Myrinet.6  Both  algo¬ 
rithms  are  scalable,  requiring  the 
exchange  of  tokens  only  among 
nearest  neighbors.  In  the  prototype, 
which  implements  isotach  func¬ 
tionality  in  software,  the  roundtrip, 
user-to-user-level  latency  of  isotach  mes¬ 
sages  is  on  the  order  of  50  /i sec,  about 
twice  that  of  nonisotach  messages  on  the 
same  hardware.7  To  further  reduce  the 
cost  of  maintaining  isotach  guarantees, 
we  are  redesigning  the  messaging-layer 
software  and  building  a  second-genera¬ 
tion  prototype  with  custom  SIUs. 

Model 

The  coherence  problem  occurs  in  sev¬ 
eral  contexts,  each  with  its  own  termi¬ 
nology.  The  terms  used  here  are  from 
the  literature  on  cache  coherence.  We 
rely  on  the  reader  interested  in  DSM  or 
replica  control  to  make  the  appropriate 
translations. 

We  consider  a  system  consisting  of 
multiple  processors  connected  to  a  mem¬ 
ory  system.  The  memory  system  encap¬ 
sulates  the  representation  of  shared 
memory  and  the  procedures  for  access¬ 
ing  it.  The  processor-memory  system 
interface  is  as  follows: 

•  Processors  issue  read  and  write 
requests  to  the  memory  system.  A 
read  request  (READ)  on  v  instructs  the 
memory  system  to  return  the  value  of 
v;  a  write  request  (WRITE)  on  variable 
v  instructs  the  memory  system  to 
assign  a  specified  value  to  v.  A  variable 
is  shared  if  more  than  one  processor 
can  issue  requests  on  it.  We  consider 
only  shared  variables. 

•  The  memory  system  returns  a  value 
in  response  to  each  READ. 

Internal  details  of  the  memory  system 
are  not  visible  to  the  processors. 

A  memory  system  consists  of  inter¬ 
connected  memories  and  controllers  pro- 
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Related  work 

Coherence  protocols  are  notoriously  hard  to  prove  correct.  Logical  time  has 
provided  a  formal  yet  intuitive  basis  for  reasoning  about  coherence1  and  has 
been  widely  used  in  reasoning  about  protocols  that  enforce  relaxed  consistency 
semantics. 

Message-ordering  guarantees  have  been  used  elsewhere  to  increase  coher¬ 
ence-protocol  concurrency.  Race-free  networks2  enforce  Sequential  Consistency 
with  fewer  restrictions  on  concurrency  by  restricting  the  network  topology  and 
the  paths  taken  by  updates.  Multicast-snooping  protocols3  use  isotach  guaran¬ 
tees  to  eliminate  acknowledgment  messages.  Protocols  can  use  totally  ordered 
multicasts  to  ensure  that  writes  execute  atomically  and  can  use  causally  ordered 
messages  to  ensure  execution  is  consistent  with  program  order.4 
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grammed  to  execute  a  coherence  proto¬ 
col.  The  memory  space  is  partitioned 
across  the  memory  modules  (MMs).  Each 
processor  has  a  cache  memory  and  cache 
controller  (CC),  which  manages  the  cache 
and  translates  locally  issued  requests  into 
operations.  An  operation  reads,  writes,  cre¬ 
ates,  or  destroys  a  copy  of  a  variable.  The 
CC  generates  one  or  more  write  opera¬ 
tions  (writes)  for  each  WRITE  and  a  sin¬ 
gle  read  operation  (read)  for  each  READ. 
The  phrase  “the  execution  of  request  R 
on  copy  c”  means  “the  execution  of  the 
operation  resulting  from  R  that  is  exe¬ 
cuted  on  copy  r.”  In  a  delta  protocol,  the 
CC  also  acts  as  the  SIU — that  is,  it  tracks 
logical  time  and  controls  the  logical  send 
times  of  locally  issued  operations. 

For  each  variable  v,  the  primary 
copy — called  the  home  copy — is  located  in 
an  MM.  The  MM  containing  zf  s  home 
copy  is  v’s  home.  Secondary  copies — called 
cache  copies — are  located  in  the  cache 
memories.  In  a  static  copy  set  protocol,  the 
number  and  locations  of  cache  copies  are 
determined  statically.  In  a  dynamic  copy  set 
protocol,  the  memory  system  can  create 
and  destroy  cache  copies.  A  request  for  v 
is  a  hit  if  a  copy  of  v  is  in  the  issuing 
processor’s  cache;  otherwise  it  is  a  miss. 

In  a  delta  protocol,  the  memory  sys¬ 
tem  sends  each  operation  as  an  isotach 
message.  The  logical  distance  and  the 
send,  receive,  and  execution  times  of  an 
operation  are  those  of  the  message  car¬ 
rying  the  operation.  An  operation  on  the 
local  cache  copy  is  sent  as  a  self-message — 
an  isotach  message  that  a  processor  sends 
to  itself.  Because  self-messages  do  not 
enter  the  network,  for  any  self-message 
m,  d(m)  =  0  and  tr(m)  =  ts(m).  Figure  1 
summarizes  terms  relevant  to  operations 
in  a  delta  protocol. 

Each  copy  c  in  a  delta  protocol  has  a 
delta,  denoted  8(c).  In  the  home  update 
protocol,  8(c)  =  distihome ,  c) — the  number 
of  logical  time  pulses  required  to  propa¬ 
gate  an  update  and  thus  the  number  of 
pulses  by  which  c  lags  behind  the  home 
copy.  The  delta  of  a  home  copy  is  zero. 
For  any  operation  op  on  c,  tejyx( op),  op’s 
effective  execution  time  is  tx(op)  -  8(c). 
Informally,  tejyx( op)  is  op’s  apparent  exe¬ 
cution  time — its  logical  execution  time 
adjusted  to  compensate  for  its  delta.  The 


execution  distance  of  op,  xdist(op),  is 
defined  as  t^.( op)  -  f/op).  Thus,  xdist(op) 
=  d( op)  -  <5(r),  where  c  is  the  copy  on 
which  the  memory  system  executes  op. 

Correctness  criteria 

A  coherence  protocol’s  most  basic 
task  is  to  make  replication  transparent  to 
the  processors  (see  the  “Related  work” 
sidebar).  The  result  of  any  execution 
should  be  as  if  the  requests  of  the  proces¬ 
sors  were  executed  on  a  single-copy 
memory — that  is,  on  a  memory  contain¬ 
ing  a  single  copy  of  each  variable.  Coher¬ 
ence  protocols  can  enforce  the  follow¬ 
ing  ordering  properties: 

•  Sequential  consistency.  A  memory  system 
enforces  SC  if  “[t]he  result  of  any  exe¬ 
cution  is  as  if  the  [requests]  of  all  the 
processors  were  executed  in  some 
sequential  order,  and  the  [requests]  of 
each  individual  processor  appear  in  this 
sequence  in  the  order  specified  by  the 
program.”8  The  execution  shown  in 
Figure  2a  violates  SC,  because  no 
sequential  ordering  of  the  requests  can 
produce  the  results  shown.  In  Figure 
2b,  even  though  Pi’s  accesses  are  exe¬ 
cuted  out  of  order,  the  execution  shown 
is  SC,  because  it  produces  the  same 


results  as  a  sequential  execution  in 
which  P 1  ’s  accesses  are  executed  in  pro¬ 
gram  order  followed  by  P2’s  accesses 
in  program  order. 

•  Atomicity.  Requests  issued  as  part  of  the 
same  transaction  or  atomic  action  are 
executed  so  that  they  appear  to  be  exe¬ 
cuted  as  an  indivisible  unit.  Thus,  the 
result  of  any  execution  is  as  if  the 
requests  of  all  the  processors  were  exe¬ 
cuted  in  some  sequential  order  and  the 
requests  in  each  transaction  appear  as 
a  contiguous  subsequence,  not  inter¬ 
leaved  with  other  requests. 

We  use  the  term  atomicity  to  mean 
consistency  (not  failure)  atomicity;  the 
guarantee  is  about  the  relative  order  in 
which  requests  appear  to  be  executed,  not 
about  the  results  of  a  failure.  A  protocol 
that  enforces  failure  atomicity  ensures  that 
all  requests  in  the  same  transaction  are 
executed  on  an  all-or-nothing  basis.  A 
fault-free  system  is  normally  assumed  in 
the  context  of  multiprocessor  cache 
coherence  and  is  often  left  to  separate 
mechanisms  in  the  DSM  context.  The 
isotach  prototype  uses  a  sender-based 
protocol  and  a  reliable  network  (Myrinet) 
to  achieve  reliable  communication.  An 
unreliable  network  would  require  using  a 
commit  protocol. 
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PI:: 

write(i/)1; 

write(w)1 

P2:: 

read(i/i/)1; 

read(i/)0 


Initially  v=w=0 
1/  w 


PI:: 

WRITE(l/) 

write(i/i/) 


P2:: 

read(i /]/) 
read(i/) 


(a) 


Initially  v=w=0 
1/  w 


PI:: 

WRITE(l/) 

write(i/i/) 

(b) 


P2:: 

READ  (H/) 
READ(l/) 


Figure  2.  (a)  Violating  sequential  consistency ;  (b)  even  though  Pi's  accesses  are  executed  out  of  order,  the  execution  is  SC. 


With  a  few  exceptions,  cache  coher¬ 
ence  protocols  for  multiprocessors  and 
DSM  protocols  focus  on  SC  (or  a  weaker 
variant)  and  leave  the  task  of  enforcing 
atomicity  to  separate  mechanisms.  On  the 
other  hand,  databases  focus  on  enforcing 
atomicity.  The  high  cost  of  enforcing  SC 
and  atomicity  has  led  to  extensive  explo¬ 
ration  of  weaker  memory  consistency 
models.  Whether  the  resulting  improve¬ 
ment  in  performance  justifies  the  more 
complex  memory  interface  is  an  unde¬ 
cided  issue.9  Delta  protocols  enforce 
atomicity  and  SC  using  isotach  ordering 
guarantees  without  the  locks  and  restric¬ 
tions  on  pipelining  required  in  conven¬ 
tional  systems.  Thus,  delta  protocols  rep¬ 
resent  an  alternative  to  weakening  the 
guarantees  the  memory  system  offers. 

Home  update  delta 
coherence  protocol 

The  home  update  protocol  is  the  sim¬ 
plest  of  the  delta  protocols  and  serves  as 
the  basis  for  the  other  delta  protocols, 
which  include  invalidate  as  well  as  update 
protocols.  As  its  name  indicates,  it  is  an 
update  protocol  in  which  the  home  is 
responsible  for  distributing  updates.  The 
protocol  is  a  directory  protocol:  for  any 
variable  v,  the  home  for  v  stores  a  direc¬ 
tory  that  records  the  set  of  processors 
with  cache  copies  of  v.  For  simplicity,  we 
assume  a  bit-vector  representation  for 
the  directory:  bit  i  in  the  bit  vector  for  v 
is  set  if  and  only  if  processor  i  has  a  cache 
copy  of  v.  (A  system  could  use  any  of 
several  tagged  directory  schemes  for 
improving  the  scalability  of  directories 
instead.)  We  describe  both  a  static  copy- 
set  version  of  the  protocol  (static  proto¬ 
col)  and  a  dynamic  copyset  version 
(dynamic  protocol). 

Executing  requests 

A  CC  translates  each  locally  issued 
request  into  one  or  more  operations,  called 
initiating  operations.  In  the  home  update 


protocol,  each  request  results  in  exactly  one 
initiating  operation.  Other  actions  the 
memory  system  takes  in  executing  a 
request  depend  on  the  request  type: 

•  READ  miss  on  variable  v.  The  CC 
generates  and  schedules  a  read  on  v’s 
home  copy  (we  describe  scheduling 
later).  On  receiving  the  read,  v’s 
home  returns  a  read  response.  In  the 
static  protocol,  a  read  response  is  sim¬ 
ply  a  message,  not  an  operation — 
there  is  no  copy  at  the  receiving  CC 
on  which  the  read  response  operates. 
On  receiving  the  read  response,  the 
CC  returns  the  value  top. 

•  READ  hit.  The  CC  generates  and 
schedules  a  read  on  v’s  cache  copy. 
(Recall  that  an  operation  on  the  local 
copy  is  a  self-message  and  has  a  logi¬ 
cal  send  and  receive  time.)  At  the  log¬ 
ical  receive  time,  the  CC  executes  the 
read  on  its  cache  copy,  returning  the 
value  top. 

•  WRITE  (hit  or  miss).  The  CC  gener¬ 
ates  and  schedules  a  write  on  z;’s 
home  copy.  On  receiving  the  write, 
v’s  home  assigns  the  value  to  the 
home  copy  and  sends  a  write  to  every 
processor  in  v’s  directory  (includingp 
ifp  is  in  v’s  directory).  Writes  that  the 
home  sends  are  usually  called  updates. 
An  own-update  is  an  update  a  CC 
received  in  response  to  its  own  write. 
On  receiving  an  update,  a  CC  assigns 
the  value  to  the  cache  copy. 

Implementing  a  dynamic  copyset 

We  can  adapt  the  protocol  so  it  can 
create  and  destroy  cache  copies:5 

•  A  CC  destroys  its  copy  of  v  by  send¬ 
ing  a  release  message  to  z^s  home. 
The  home  executes  the  release  by 
removing  p  from  v’s  directory. 

•  A  CC  creates  a  cache  copy  as  a  result 
of  a  miss.  When  the  home  receives  an 
operation  on  v  thatp  sends,  it  addsp 
to  v’s  directory.  When  a  CC  with  no 


valid  cache  copy  of  v  executes  a  read 
response  or  an  own-update  on  v,  it 
creates  a  cache  copy. 

Using  isotach  guarantees 

The  home  update  protocol,  as 
described  so  far,  is  similar  to  other  update 
protocols.  The  protocol  differs  from  oth¬ 
ers  in  its  use  of  isotach  guarantees. 

The  isotach  invariant  lets  each  CC 
control  the  effective  execution  time  of 
requests  by  scheduling  the  send  times 
of  initiating  operations.  The  scheduling 
algorithm  uses  this  control  to  enforce 
SC: 

lastR  =  0 

for  each  request  R  issued  by  p 
if  R  is  a  READ  hit  xdist  = 
dist (home, p) 

else  xdist  =  dist  (p, home)  ; 

sendp  =  max (lastR-xdist, now) ; 

lastR  =  sendp; 

The  CC  tracks  lastR,  the  effective 
execution  pulse  of  the  last  request  it 
scheduled,  and  schedules  the  initiating 
operation  of  each  new  request  so  that  its 
effective  execution  pulse  is  no  less  than 
lastR.  The  effective  execution  pulse  is 
the  pulse  component  of  the  effective  exe¬ 
cution  time.  A  request  with  the  same 
effective  execution  pulse  as  the  previous 
request  has  a  later  effective  execution 
time  due  to  its  rank  component. 

The  home  MM  sends  updates  as  pre¬ 
dictable  responses,  allowing  the  defini¬ 
tion  of  copy  deltas  and  establishing  the 
relationship  between  each  variable’s 
cache  and  home  copies.  As  we  show 
later,  sending  updates  as  predictable 
responses  also  ensures  that  all  writes 
resulting  from  the  same  WRITE  have  the 
same  effective  execution  time,  with  the 
result  that  each  WRITE  appears  to  exe¬ 
cute  atomically.  Although  the  execution 
times  of  the  writes  can  differ,  their  effec¬ 
tive  execution  times  are  identical  because 
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We  show  that  for  any  execution  E  of  any  program  P  on  a 
memory  system  M  under  the  static  protocol,  there  is  an  equiv¬ 
alent  sequential  execution  S  of  P  on  a  single-copy  memory 
system  M',  where  5  is  sequential  consistency. 

Definition.  For  any  request  /?,  the  effective  execution  time 
of  R  is  the  effective  execution  time  of  the  initiating  operation 
resulting  from  R. 

Lemma  1.  The  effective  execution  times  of  requests  derived 
from  execution  E  of  program  P  define  a  total  order  over  the 
requests  in  P. 

Proof.  Because  each  logical  time  is  a  three-tuple  in  which 
the  second  and  third  components  serve  as  tie-breakers,  each 
initiating  operation  for  a  request  in  E  has  a  unique  effective 
execution  time. 

Definition.  For  any  program  P,  let  P  be  the  permutation  of 
P  in  which  the  requests  in  P  appear  in  increasing  order  by 
their  effective  execution  times. 

Definition.  Let  5  be  the  execution  of  P  in  which  the  requests 
in  P  are  executed  on  M'  in  the  order  in  which  the  requests 
appear  in  P. 

Lemma  2.  All  operations  resulting  from  the  same  request 
have  the  same  effective  execution  time. 

Proof.  Because  a  read  results  in  only  one  operation,  the 
claim  is  trivially  true  for  reads.  A  write  results  in  an  initiat¬ 
ing  write  executed  on  the  home  copy  and  an  update  on  each 
cache  copy.  The  effective  execution  time  of  the  initiating  write 
w  is  tr(w),  because  the  write  is  executed  on  the  home  copy 
and  the  delta  of  the  home  copy  is  zero.  For  any  update  u,  the 
home  sends  to  copy  c'  in  response  to  tv,  teffx(u)  =  tr(u)  -  8(d). 
Because  u  is  a  predictable  response  to  operation  op,  tr(u)  = 
tr(w)  +  dist(home,d).  Thus,  teffx(u)  =  tr(w)  +  dist(home,d)  - 
8(d).  Because  8(d)  =  dist(home,d),  teffx(u)  =  teffx(w). 

Lemma  3.  For  any  copy  c  in  E  or  S  and  any  two  operations 
op  and  op'  executed  on  c,  op  and  op'  are  executed  in  the 
order  of  their  effective  execution  times. 

Proof.  For  all  operations  that  M  executes  on  any  copy  c  in 
E,  the  difference  between  the  effective  execution  time  and 
the  execution  time  is  the  same,  8(c).  Thus,  teffx(o p)  is  less  than 
teffx( opO  if  and  only  if  tx( op)  is  less  than  tx(op') — in  other  words, 
M  executes  operations  on  c  in  increasing  order  by  their  effec¬ 
tive  execution  times.  By  definition  of  S,  M'  executes  all  opera¬ 
tions  in  5,  including  any  operations  executed  on  any  given  copy 
c,  in  increasing  order  of  their  effective  execution  times. 


Lemma  4.  For  any  two  requests  R  and  /?',  if  R  is  executed  on 
copy  c  before  /?',  then  R  is  executed  before  /?'  on  every  copy 
on  which  both  R  and  R '  are  executed. 

Proof.  By  Lemmas  2  and  3. 

Lemma  5.  Every  read  in  P  returns  the  same  value  in  S  and  E. 

Proof.  Consider  any  read  R '  on  any  variable  v  in  P.  M  exe¬ 
cutes  /?'  on  exactly  one  copy  c.  Let  1/1/ be  the  immediately  pre¬ 
ceding  write  on  c — the  write  that  assigns  the  value  M 
returns  in  response  to  R'.  Because  every  request  M  executes  in 
E  is  executed  by  M'  in  5,  M'  executes  1/1/ and  R'  on  the  copy  of 
vin  S.  By  Lemma  4,  because  M  executes  1/1/ before  /?'  on  c  in  E, 
M'  executes  1/1/  before  R '  on  the  copy  of  v  in  S.  We  show  by 
contradiction  that  there  is  no  intervening  write  1/1/  between 
1/1/ and  /?'  on  the  copy  in  S.  Because,  in  the  static  protocol,  M 
executes  every  write  on  von  every  copy  of  v,  M  executes  1/1/' 
on  c.  By  Lemma  4,  M  executes  1/1/  on  c  between  1/l/and  /?',  con¬ 
tradicting  the  assumption  that  1/1/  is  the  immediately  preced¬ 
ing  write  for  /?'  in  E.  Thus,  the  same  write  is  the  immediately 
preceding  write  for  R '  in  E  and  S  and  E  and  S  return  the  same 
value  for  R'. 

Lemma  6.  Execution  5  is  SC. 

Proof.  Consider  any  two  requests  R  and  /?'  issued  by  the 
same  processor  p,  where  R  is  issued  before  R'.  Let  op  be  the 
initiating  operation  for  R  and  op'  be  the  initiating  operation 
for  /?'.  By  the  scheduling  algorithm,  the  CC  chooses  ts(op') 
such  that  teffx( op')  >  teffx(op).  Thus,  /?'  appears  after  R  in  P/ 
and  is  executed  after  R  in  S. 

Theorem.  The  static  protocol  is  correct. 

Proof.  By  Lemma  5,  E  and  S  are  equivalent.  By  Lemma  6,  S 
is  SC.  Thus,  the  result  of  any  execution  on  a  memory  system 
using  the  static  protocol  is  the  same  as  if  it  were  executed  on 
M'  in  some  sequential  order  consistent  with  the  program  order. 

Proof  of  the  dynamic  protocol  requires  showing  that  every 
read  operation  is  executed  on  a  copy  that  has  received  the 
immediately  preceding  write.1 
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each  copy’s  delta  exactly  offsets  the  time 
required  to  propagate  the  update. 

Correctness 

We  prove  the  correctness  of  the  static 
protocol  in  the  “Static  protocol”  sidebar. 
A  proof  of  the  full  protocol  appears  else¬ 
where.5  We  show  that  the  result  of  any 
execution  under  the  protocol  is  the  same 
as  it  would  be  if  it  were  executed  on 
memory  system  AT,  which  is  known  to 
be  correct.  Memory  system  M'  executes 
requests  serially  in  some  sequential  order 


on  a  single-copy  memory,  translating 
each  request  into  a  single  operation.  We 
show  that  for  any  execution  E  of  any  pro¬ 
gram  P  on  a  memory  system  that  uses  the 
static  protocol,  there  is  an  equivalent 
sequential  execution  S’  of  P  on  AT,  where 
S  is  SC.  A  program  is  a  sequence  of 
requests,  in  the  order  in  which  they  are 
submitted,  and  an  execution  of  program 
P  is  the  sequence  of  operations  resulting 
from  P  in  the  order  in  which  they  are  exe¬ 
cuted.  Executions  S  and  E  of  P  are  equiv¬ 
alent  if  every  READ  in  S  returns  the  same 
value  as  the  corresponding  READ  in  E. 


Atomicity 

We  can  adapt  the  home  update  pro¬ 
tocol  to  execute  batches  of  requests 
atomically.  Exploiting  this  capability 
changes  the  programming  model — 
instead  of  using  locks  or  barriers  to 
enforce  atomicity,  a  processor  issues 
batches  of  requests  called  isochrons  and 
the  memory  system  executes  the  requests 
in  each  isochron  so  that  they  appear  to 
be  executed  at  the  same  time.  Because  a 
processor  must  issue  all  the  requests  in 
an  isochron  as  a  batch,  isochrons  cannot 
contain  internal  data  dependences. 
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However,  we  can  implement  atomic 
actions  with  internal  data  dependences, 
using  isochrons  together  with  a  class  of 
operations  called  split  operations .10 

Adapting  the  protocol  to  enforce  atom¬ 
icity  requires  changing  only  the  sched¬ 
uling  algorithm.  Each  CC  schedules 
requests  so  that  all  requests  in  the  same 
isochron  have  the  same  effective  execution 
pulse,  and  schedules  each  isochron  so  that 
it  has  an  effective  execution  pulse  no  less 
than  the  previously  scheduled  isochron. 
We  show  requests  in  each  isochron  are 
executed  atomically  by  showing  that  the 
requests  occur  in  the  equivalent  serial  exe¬ 
cution  S'  as  a  contiguous  subsequence.5 
Because  all  requests  in  the  same  isochron 
have  the  same  effective  execution  pulse 
and  are  issued  by  the  same  processor  as  a 
batch,  no  other  request  can  have  an  inter¬ 
vening  effective  execution  time. 


IN  DELTA  PROTOCOLS,  each  copy  c 
has  a  delta,  5(c) ,  equal  to  the  number  of 
logical  pulses  by  which  the  copy  lags 
behind  the  home  copy.  The  deltas  let 
nodes  control  the  order  in  which  requests 
appear  to  execute  and  facilitate  proving 
delta  protocols  correct. 

Delta  coherence  protocols  use  isotach 
guarantees  to  enforce  SC  with  fewer 
restrictions  on  concurrency  than  exist¬ 
ing  protocols.  First  of  all,  under  these 
protocols,  the  memory  system  can 
pipeline  requests.  Existing  protocols 
that  enforce  SC  require  that  the  execu¬ 
tion  of  a  request  not  start  until  the  exe¬ 
cution  of  the  previous  request  issued  by 
the  same  processor  completes.11  (Sarita 
V.  Adve  and  Mark  D.  Hill  have  pro¬ 
posed  an  SC  protocol  that  lets  nodes 
overlap  the  execution  of  a  WRITE  with 
another  request,  with  a  restriction  that 
the  effect  of  the  second  request  cannot 
be  visible  to  any  node  until  after  the 
WRITE  is  globally  performed.12)  Delta 
protocols  can  overlap  the  execution  of 
requests,  requiring  only  that  a  request 
not  appear  to  complete  before  the  pre¬ 
vious  request  completes — in  other 


words,  that  its  effective  execution  time 
not  precede  that  of  the  previous  request. 

Second,  delta  coherence  protocols 
don’t  require  acknowledgments.  Exist¬ 
ing  protocols  use  acknowledgments  to 
inform  a  node  when  its  WRITE  com¬ 
pletes.  Relying  on  acknowledgments 
adds  message  traffic  and,  more  impor¬ 
tantly,  increases  latency — a  node  delays 
executing  a  request  not  just  until  the 
completion  of  the  previous  request,  but 
until  it  receives  acknowledgment  of  the 
completion.  In  delta  protocols,  a  node 
determines  from  local  information  the 
completion  time  of  each  request  before 
it  sends  the  initiating  operation. 

Third,  multiple  processors  can  write 
the  same  variable  concurrently.  Invali¬ 
date  protocols  do  not  permit  concurrent 
writes,  though  update  protocols  do,  sub¬ 
ject  to  the  restriction  that  writes  are  not 
immediately  readable. 

Fourth,  writes  are  immediately  read¬ 
able.  In  the  absence  of  strong  message¬ 
ordering  guarantees,  existing  protocols 
that  ensure  SC  cannot  return  the  value 
of  a  read  to  a  cache  copy  until  the  WRITE 
that  supplied  that  value  is  globally  per¬ 
formed — that  is,  until  all  cache  copies  are 
updated  or  invalidated.11  This  require¬ 
ment  is  easy  to  satisfy  in  invalidation  pro¬ 
tocols  but  difficult  in  update  protocols. 

Finally,  processors  can  execute  mul¬ 
tiple  requests  atomically  without  locks. 
Most  existing  protocols  that  enforce 
atomicity  use  two-phase  locking.  Alter¬ 
natively,  protocols  can  assign  transac¬ 
tions  timestamps  and  abort  and  restart 
any  transactions  that  cannot  be  executed 
in  timestamp  order.  Delta  protocols  let 
a  processor  access  multiple  variables 
atomically  without  locks  or  restarts. 
Processors  can  execute  isochrons  with¬ 
out  synchronizing  or  obtaining  exclusive 
access  to  the  variables  accessed. 

Delta  protocols  offer  a  significantly 
higher  level  of  concurrency  than  exist¬ 
ing  coherence  protocols,  while  a  proto¬ 
type  isotach  network  implementation 
demonstrates  that  the  cost  of  providing 
this  additional  concurrency  is  low.  We 
expect  delta  protocols  to  be  useful  in 
applications  that  maintain  many  copies 
of  data  items  or  in  which  data  contention 
is  high.  ^ 
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